Skip to content

Batch main sumcheck across chips#1333

Open
hero78119 wants to merge 51 commits intomasterfrom
feat/batch_main_sumcheck
Open

Batch main sumcheck across chips#1333
hero78119 wants to merge 51 commits intomasterfrom
feat/batch_main_sumcheck

Conversation

@hero78119
Copy link
Copy Markdown
Collaborator

@hero78119 hero78119 commented Apr 29, 2026

Problem

Main sumcheck was proved and verified per chip, which duplicated transcript work, selector/claim handling, and PCS opening plumbing across chips.

Design Rationale

Use one global batched main sumcheck proof while keeping PCS openings in the existing suffix path. The verifier mirrors the prover transcript order, including ECC bridge sampling before the global combine subset evals challenge, and evaluates frontloaded expressions in the verifier.

Change Highlights

  • ceno_zkvm: batches main constraints into a single global proof path across chip proofs.
  • ceno_zkvm: keeps witness/fixed PCS openings per chip after global main verification.
  • ceno_recursion: mirrors native verifier changes for the batched main proof.
  • ceno-gpu: supports the batched main proving flow.

Benchmark / Performance Impact

Benchmark session compares the frontload baseline against successive feat/batch_main_sumcheck optimization runs on block 23817600, GPU proving, CENO_GPU_ENABLE_WITGEN=0.

Comparison convention: lower time is better. Signed x values use -Nx for slower-than-baseline wall time and +Nx for faster/lower-time metrics; for example, taking twice as long is -2.00x.

Timeline / Optimization Progress

Date Run Ceno / GPU Commit E2E vs Baseline app_prove vs Baseline prove_batched_main_constraints Short Highlight
May 6 25419833788 / job 74559223217 Ceno 7a07649b, GPU 1118dca8 75.600s Baseline 61.000s Baseline 0.000s Baseline: frontload, per-chip main constraints
May 9 AM 25594090744 / job 75136918384 Ceno dd229c00, GPU 340651b4 103.000s -1.36x 87.400s -1.43x 0.000s Batched branch after alpha.28 upgrade; tower/extract totals much lower but wall time regressed
May 9 PM 25603601935 / job 75161599043 Ceno d5ae1b3a, GPU fbef26f3 104.000s -1.38x 88.300s -1.45x 26.925s Batched main proof enabled; new batched-main critical path dominates
May 11 25655529702 / job 75302942526 Ceno c2c45cc9, GPU 3dedbc78 91.800s -1.21x 76.500s -1.25x 15.457s Latest optimization: direct batched-main construction + bucketed fold/eval GPU sumcheck

E2E / Layer

Metric Baseline Latest Optimization Comparison
E2E total 75.600s 91.800s -1.21x
emulator 10.100s 10.200s -1.01x
app_prove wall time 61.000s 76.500s -1.25x

App Prove Breakdown

Profiler module totals can overlap because chip proving is concurrent; use app_prove wall time above for critical-path impact. The latest run materially reduces the new batched-main cost, but total wall time is still slower than the frontload baseline.

Operation Baseline Batched May 9 AM Batched May 9 PM Latest May 11 Latest vs Baseline
prove_batched_main_constraints 0.000s 0.000s 26.925s 15.457s New cost
prove_main_constraints 22.622s 0.000s 0.000s 0.000s Removed
extract_witness_mles 24.155s 3.760s 3.713s 3.739s +6.46x
build_tower_witness_gpu 3.491s 0.323s 0.316s 0.323s +10.81x
prove_tower_relation_gpu 176.090s 24.008s 24.417s 24.857s +7.08x
pcs_opening 15.246s 15.207s 15.164s 15.175s +1.00x
commit_traces 6.827s 6.814s 6.851s 6.857s -1.00x
parsed rows total 251.118s 50.995s 78.287s 67.460s +3.72x

Latest Improvement Against Previous Batched Run

Metric May 9 PM Batched Main May 11 Latest Improvement
E2E total 104.000s 91.800s +1.13x
app_prove wall time 88.300s 76.500s +1.15x
prove_batched_main_constraints 26.925s 15.457s +1.74x
parsed rows total 78.287s 67.460s +1.16x

Benchmark command:

CENO_GPU_ENABLE_WITGEN=0 CENO_CONCURRENT_CHIP_PROVING=1 CENO_GPU_CACHE_LEVEL=0 \
RUSTFLAGS="-C target-feature=+avx2" \
cargo run --features "jemalloc,gpu" --release --bin ceno-reth-benchmark-bin -- \
  --mode prove-app --block-number 23817600 --rpc-url <redacted> \
  --output-dir output --cache-dir rpc-cache

Environment:

Summary: latest optimization improves prove_batched_main_constraints by +1.74x against the previous batched-main run (26.925s -> 15.457s) and improves E2E by +1.13x (104.000s -> 91.800s). It remains slower than the frontload baseline (75.600s -> 91.800s, -1.21x), with the remaining gap concentrated in the new batched-main critical path.

Testing

RUST_MIN_STACK=33554432 cargo check --package ceno_recursion --bin e2e_aggregate
RUST_MIN_STACK=33554432 cargo run --release --package ceno_recursion --bin e2e_aggregate -- --platform=ceno --max-cycle-per-shard=1600 examples/target/riscv32im-ceno-zkvm-elf/release/examples/keccak_syscall

Also passed the linked GPU e2e benchmark run above.

Risks and Rollout

  • Soundness risk is concentrated in transcript ordering and verifier frontload evaluation; native and recursion verifiers now follow the same global proof flow.
  • Performance is not yet an E2E win in the linked benchmark despite removing per-chip main-constraint cost; further scheduling/host-overlap work is needed before rollout as a performance improvement.

Follow-ups

  • Investigate reducing the new prove_batched_main_constraints critical-path cost.
  • Keep benchmark summaries explicit that parsed module totals overlap and are not a wall-time decomposition.

Copilot Reviewer Directive (keep this section)

When Copilot reviews this PR, apply .github/copilot-instructions.md strictly.

@hero78119 hero78119 marked this pull request as draft April 29, 2026 13:52
Base automatically changed from feat/prover_mle_zero_padding to master May 4, 2026 07:55
@hero78119 hero78119 changed the title batch main sumcheck Batch main sumcheck across chips May 9, 2026
@hero78119 hero78119 marked this pull request as ready for review May 9, 2026 12:22
hero78119 added 2 commits May 9, 2026 21:06
Build batched main sumcheck virtual polynomials directly from monomial terms instead of reconstructing a large Expression tree and monomializing it again. This removes expensive expression rebuild work on CPU proof generation while preserving proof semantics. Also extend integration timeout to allow the existing slow batched proof path to complete after increasing stack size.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant